62 research outputs found
Promoting user engagement and learning in amorphous search tasks
Much research in information retrieval (IR) focuses on optimization of the rank of relevant retrieval results for single shot ad hoc IR tasks. Relatively little research has been carried out on user engagement to support more complex search tasks. We seek to improve user engagement for IR tasks by providing richer representation of retrieved information. It is our expectation that this strategy will promote implicit learning within search activities. Specifically, we plan to explore methods of finding semantic concepts within retrieved documents, with the objective of creating improved document surrogates. Further, we would like to study search effectiveness in terms of different facets such as the user’s search experience, satisfaction, engagement and learning. We intend to investigate this in an experimental study, where our richer document representations are compared
with the traditional document surrogates for the same user queries
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
DCU-SEManiacs at SemEval-2016 task 1: synthetic paragram embeddings for semantic textual similarity
We experiment with learning word representations designed to be combined into sentence level semantic representations, using an objective function which does not directly make use of the supervised scores provided with the training data, instead opting for a simpler objective which encourages similar phrases to be close together in the embedding space. This simple objective lets us start with high quality embeddings trained using the Paraphrase Database (PPDB) (Wieting et al., 2015;
Ganitkevitch et al., 2013), and then tune these embeddings using the official STS task training data, as well as synthetic paraphrases for each test dataset, obtained by pivoting through machine translation. Our submissions include runs which only
compare the similarity of phrases in the embedding space, directly using the similarity score to produce predictions, as well as a run which uses vector similarity in addition to a suite of features we investigated for our 2015 Semeval submission.
For the crosslingual task, we simply translate the Spanish sentences to English, and use the same system we designed for the monolingual task
Generalized Zero-Shot Learning via Synthesized Examples
We present a generative framework for generalized zero-shot learning where
the training and test classes are not necessarily disjoint. Built upon a
variational autoencoder based architecture, consisting of a probabilistic
encoder and a probabilistic conditional decoder, our model can generate novel
exemplars from seen/unseen classes, given their respective class attributes.
These exemplars can subsequently be used to train any off-the-shelf
classification model. One of the key aspects of our encoder-decoder
architecture is a feedback-driven mechanism in which a discriminator (a
multivariate regressor) learns to map the generated exemplars to the
corresponding class attribute vectors, leading to an improved generator. Our
model's ability to generate and leverage examples from unseen classes to train
the classification model naturally helps to mitigate the bias towards
predicting seen classes in generalized zero-shot learning settings. Through a
comprehensive set of experiments, we show that our model outperforms several
state-of-the-art methods, on several benchmark datasets, for both standard as
well as generalized zero-shot learning.Comment: Accepted in CVPR'1
Position paper: promoting user engagement and learning in search tasks by effective document representation
Much research in information retrieval (IR) focuses on optimization of the rank of relevant retrieval results for single shot ad hoc IR tasks with straightforward information needs. Relatively little research has been carried out to study and support user learning and engagement for more complex search tasks. We introduce an approach intended to improve topical knowledge of a user while undertaking IR tasks. Specifically, we propose to explore methods of finding useful and informative textual units (semantic concepts) within retrieved documents, with the objective of creating improved document surrogates for presentation within the search process. We hypothesize that this strategy will promote improved implicit
learning within search activities. We believe that the richer document representations proposed in the paper would help to promote engagement, understanding and learning as compared to more traditional search engine document snippets. We propose a framework for holistic evaluation of our proposed document representations and their use in search
How do users perceive information: analyzing user feedback while annotating textual units
We describe an initial study of how participants perceive information when they categorize highlighted textual units within a document marked for a given information need. Our investigation explores how users look at different parts of the document and classify textual units within retrieved documents on 4-levels of relevance and importance. We compare how users classify different textual units within a document, and report mean and variance for different users across different topics. Further, we analyze and categorise the reasons provided by users while rating textual units within retrieved documents. This research shows some interesting observations regarding why some parts of the document are regarded as more relevant than others (e.g. it provides contextual information, contains background information) and which kind of information seems to be effective for
satisfying the end users (e.g showing examples, providing facts) in a search task. This work is a part of our ongoing investigation into generation of effective surrogates and document summaries based on search topics and user interactions with information
Promoting user engagement and learning in search tasks by effective document representation
Much research in information retrieval (IR) focuses on optimisation of the rank of relevant retrieval results for single shot ad hoc IR tasks. Relatively little research has been carried out on supporting and promoting user engagement within search tasks. We seek to improve user experience by use of enhanced document snippets to be presented during the search process to promote user engagement with retrieved information. The primary role of document snippets within search has traditionally been to indicate the potential relevance of retrieved items to the user’s information need. Beyond the relevance of an item, it is generally not possible to infer the contents of individual ranked results just by reading the current snippets. We hypothesise that the creation of richer document snippets and summaries, and effective presentation of this information to users will promote effective search and greater user engagement, and support emerging areas such as learning through search.
We generate document summaries for a given query by extracting top relevant sentences from retrieved documents. Creation of these summaries goes beyond exist- ing snippet creation methods by comparing content between documents to take into account novelty when selecting content for inclusion in individual document sum- maries. Further, we investigate the readability of the generated summaries with the overall goal of generating snippets which not only help a user to identify document relevance, but are also designed to increase the user’s understanding and knowledge of a topic gained while inspecting the snippets.
We perform a task-based user study to record the user’s interactions, search be- haviour and feedback to evaluate the effectiveness of our snippets using qualitative and quantitative measures. In our user study, we found that richer snippets generated in this work improved the user experience and topical knowledge, and helped users to learn about the topic effectively
DCU at FIRE 2013: cross-language !ndian news story search
We present an overview of our work carried out for DCU’s participation in the Cross Language !ndian News Story Search (CL!NSS) task at FIRE 2013. Our team submitted 3 main runs and 2 additional runs for this task. Our approach consisted of 2 steps: (1) the Lucene search engine was used with varied input query formulations using different features and heuristics designed to identify as many relevant documents as possible to improve recall; (2) document list merging and re-ranking was performed with the incorporation of a date feature. The results of our best run were ranked first among official submissions based on NDCG@5 and NDCG@10 values and second for NDCG@1 values. For the 25 test queries the results of our best main run were NDCG@1 0.7400, NDCG@5 0.6809 and NDCG@10 0.7268
The good, the bad and their kins: identifying questions with negative scores in StackOverflow
A rapid increase in the number of questions posted on community question answering (CQA) forums is creating a need for automated methods of question quality moderation to improve the effectiveness of such forums in terms of response
time and quality. Such automated approaches should aim to classify questions as good or bad for a particular forum as soon as they are posted based on the guidelines and quality standards defined/listed by the forum. Thus, if a question meets the standard of the forum then it is classified as good else we classify it as bad. In this paper, we propose a method to address this problem of question classification by retrieving similar questions previously asked in the same forum, and then using the text from these previously asked similar questions to predict the quality of the current question. We empirically validate our proposed approach
on the set of StackOverflow data, a massive CQA forum for programmers, comprising of about 8M questions. With the use of these additional text retrieved from similar questions, we are able to improve the question quality prediction accuracy by about 2.8% and improve the recall of negatively scored questions by
about 4.2%. This improvement of 4.2% in recall would be helpful in automatically flagging questions as bad (unsuitable) for the forum and will speed up the moderation process thus saving time and human effort
- …